---
title: "Replication Material for: How technological change affects regional voting patterns. Prepare data sent to IAB"
output: html_document
author: Nikolas Schöll and Thomas Kurer
editor_options: 
  chunk_output_type: inline
---

# Introduction

A key data channel of this paper was to obtain county-level (Kreise & kreisfreie Städte) information on employment shares in different industries.

This is required for the shift-share instrument we employ to estimate the regional distribution of robots, ICT and trade exposure. Furthermore we required on county-level distribution of occupation to estimate regional task distributions, skill-distribution and employment-by sector distributions.

Only the IAB, attached to the German Employment Agency (Bundesagentur für Arbeit), disposes of such data as they have access to all social security data.

To researchers, they offer a 2% sample of all individuals registered with social security in Germany and daily information on their employment (industry, occupation, skill-level, education, etc.). In particular, we use the SIAB data set (<https://www.iab.de/en/publikationen/weitere-publikationen/weitere-publikationen-details.aspx/Publikation/k190802304>).

For privacy reasons, the data can only be accessed on-site. Therefore, we had to prepare industry-level data on robots, ICT and trade and send it to the IAB before we could create a county-level within the system of the IAB which was then exported and made available to us to execute our analysis in R. Since the IAB server only works in STATA, we prepared necessary input data sets and crosswalks as STATA data sets (.dta). These files can be found in the folder "orig".

Furthermore, the scripts that create the sample on the servers of the IAB can be found in the folder "prog" (folder names mirror the structure on the IAB server).

# Setup

```{r}
pacman::p_load(tidyverse,
               magrittr,
               haven,
               readxl)
```

# 1. Robots data from IFR

IFR and IAB both use NACE rev.2 (equivalent to ISIC4 and Wirtschaftszweige 2008 in Germany) as industry classification. However, they use different granularity of industry classification. IAB uses 3-digit classification. For the IFR data it depends, in manufacturing it goes down till 4-digit classification whereas in other industries there is only information on the 2-digit level available. We define the largest common denominator between both data sources as "robots_aggregated". Therefore, we first aggregate both robots per industry/year and data from IFR and employment per region/industry/year to the level of robots_aggregated.

## Prepare crosswalk IAB (WZ08 = NACE rev.2) to robots_aggregated

For this we first manually coded a crosswalk between NACE rev.2 and robots_aggregated. We export this as two separate stata files to use them at the IAB.

```{r}
crosswalk = read_xlsx("../data/IFR/crosswalk_robots-aggregated_nace2.xlsx")

crosswalk %>% 
  filter(digits == 2) %>% 
  write_dta("orig/cw_robots-aggregated_nace2_2digits.dta")


crosswalk %>% 
  filter(digits == 3) %>% 
  write_dta("orig/cw_robots-aggregated_nace2_3digits.dta")
```

## IFR data to robots aggregated

Aggregate robots data from IFR to level of robots_aggregated and store as .dta file to upload to IAB server. The IFR data is not included in the replication files as it needs to be purchased.

```{r}
crosswalk = readxl::read_excel(path = "../data/IFR/crosswalk_IFR_robots-aggregated.xlsx", sheet = 1)

# Create data frame of robots for DE and EU, using the lowest level of aggregation possible
robots_list = list()
for (country_code in c("DE", "EU")){
  robots_list[[country_code]] =
    readxl::read_excel(paste0("../data/IFR/raw_data/industry_", country_code, "_operational.xlsx"),
                       skip = 2) %>% 
    mutate(IFR_code = lag(...1)) %>%  #code is shifted
    select(-...1,-...2) %>% 
    filter(IFR_code %in% crosswalk$IFR_code, #only use relevant categories
           !is.na(IFR_code)) %>% 
    left_join(crosswalk) %>% 
    # select(industry_to_merge_on, data_from, IFR_code, everything()) %>% 
    gather(key = "year", value = "robots", `1993`:`2017`) %>% 
      select(-IFR_code) %>% 
      group_by(robots_aggregated, Dauth_label, year) %>% 
      summarise_all(sum)
    
}

# calculate number of robots in the EU excluding Germany:
robots_list[["EU_non_DE"]] = 
  robots_list[["EU"]] 
robots_list[["EU_non_DE"]]$robots = robots_list[["EU"]]$robots - robots_list[["DE"]]$robots

# store
robots_list[["DE"]] %>% 
  write_dta(path = paste0("orig/robots_DE_operational.dta"))
robots_list[["EU_non_DE"]] %>% 
  write_dta(path = paste0("orig/robots_EU_non_DE_operational.dta"))

```

# 2. Trade

## Code to redownload data from Comtrade

Lists of country codes:

| country code | code    |
|--------------|---------|
| 276          | Germany |

: Reporter

| country code | country            |
|--------------|--------------------|
| 100          | Bulgaria           |
| 112          | Belarus            |
| 203          | Czechia            |
| 348          | Hungary            |
| 498          | Rep. of Moldova    |
| 616          | Poland             |
| 703          | Slovakia           |
| 804          | Ukraine            |
| 642          | Romania            |
| 643          | Russian Federation |
| 156          | China              |

: Partner

Code to download raw data from COMTRADE API. This step can be skipped and use downloaded data instead (see below).

```{r}
# pacman::p_load(tidyverse,
#                rjson, 
#                magrittr)
# get.Comtrade <- function(url="http://comtrade.un.org/api/get?"
#                          ,maxrec=50000
#                          ,type="C"
#                          ,freq="A"
#                          ,px="HS"
#                          ,ps="now"
#                          ,r
#                          ,p
#                          ,rg="all"
#                          ,cc="TOTAL"
#                          ,fmt="json"
# )
# {
#   string<- paste(url
#                  ,"max=",maxrec,"&" #maximum no. of records returned
#                  ,"type=",type,"&" #type of trade (c=commodities)
#                  ,"freq=",freq,"&" #frequency
#                  ,"px=",px,"&" #classification
#                  ,"ps=",ps,"&" #time period
#                  ,"r=",r,"&" #reporting area
#                  ,"p=",p,"&" #partner country
#                  ,"rg=",rg,"&" #trade flow
#                  ,"cc=",cc,"&" #classification code
#                  ,"fmt=",fmt        #Format
#                  ,sep = ""
#   )
#   
#   if(fmt == "csv") {
#     raw.data<- read.csv(string,header=TRUE)
#     return(list(validation=NULL, data=raw.data))
#   } else {
#     if(fmt == "json" ) {
#       raw.data<- fromJSON(file=string)
#       data<- raw.data$dataset
#       validation<- unlist(raw.data$validation, recursive=TRUE)
#       ndata<- NULL
#       if(length(data)> 0) {
#         var.names<- names(data[[1]])
#         data<- as.data.frame(t( sapply(data,rbind)))
#         ndata<- NULL
#         for(i in 1:ncol(data)){
#           data[sapply(data[,i],is.null),i]<- NA
#           ndata<- cbind(ndata, unlist(data[,i]))
#         }
#         ndata<- as.data.frame(ndata)
#         colnames(ndata)<- var.names
#       }
#       return(list(validation=validation,data =ndata))
#     }
#   }
# }
# 
# pacman::p_load(rjson)
# string <- "http://comtrade.un.org/data/cache/partnerAreas.json"
# country_codes <- fromJSON(file=string)
# country_codes = country_codes$results %>% t() %>% map(~as.tibble(.)) %>% bind_rows()


# reporters = c("276") #Germany
# partners = c(100, 112, 203, 348, 498, 616, 703, 804, 642, 643, 156) %>% as.character()
# years = 1990:2018 %>% as.character()
# combinations = expand.grid(reporters, partners, years) %>% nrow()
# #flows = list()
# 
# # sometimes crashes so keep retrying until all data downloaded:
# while(length(flows) < combinations){
#   for (reporter in reporters){
#     for (partner in partners){
#     
#       for (year in years){
#         name = paste(reporter, partner, year, sep = "_")
#         
#         if (name %in% names(flows)){
#           next()
#         }
#   1
#         if (name %in% c("842_112_1999", "842_112_2003", "842_112_2016")){
#           next()
#         }
#         cat(" ", name)
#         
#         response = NULL
#         try( {
#           response = get.Comtrade(r = reporter,
#                       p= partner,
#                       px = "S3", # classification SITC4
#                       ps = year, #year
#                       cc = "AG5", #level of disaggregation
#                       fmt="csv")
#           
#           response = response$data %>% 
#             select(sitc = Commodity.Code,
#              Commodity,
#              year = Year,
#              Reporter,
#              Partner,
#              direction = Trade.Flow,
#              value = Trade.Value..US..)
#           
#           if(nrow(response) > 0){
#             flows[[name]] = response
#           }
#         
#           
#           
#           print("Success!")
#           }, silent = FALSE)
#         # 
#   
#         if (is.null(response)){
#           print("error")
#           #Sys.sleep(60)
#           next()
#         }
#       }
#     }
#   }
# }
# flows %>% write_rds("trade_flow_list_5d.RDS")

```

## Process trade data

Here we rely on already downloaded data:

```{r}
trade = 
  read_rds("../data/COMTRADE/trade_flow_list_5d.RDS") %>% 
  bind_rows() %>%
  filter(Reporter == "Germany") %>% 
  mutate(Partner_region = ifelse(Partner == "China","China", "Eastern_Europe")) %>% 
  group_by(Reporter, Partner_region, year, direction, sitc, Commodity) %>% 
  summarise(value = sum(value),
            n = n())

```

## Convert to constant euros

-   data on exchange rates from EUROSTAT (<https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/ert_bil_eur_a.tsv.gz>),

-   CPI data from FRED (<https://fred.stlouisfed.org/series/CPHPTT01EZA661N>)

```{r}
exchange_rate = read.csv("../data/EUROSTAT/ert_bil_eur_a_1_Data.csv")
exchange_rate = exchange_rate %>% 
  filter(CURRENCY == "US dollar",
         STATINFO == "Average") %>% 
  select(year = TIME,
         usd_in_eur = Value) %>% 
  filter(year >= 1991) %>% 
  mutate(usd_in_eur = usd_in_eur %>% as.character() %>%  as.numeric())


cpi = read.csv("https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=off&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=CPHPTT01EZA661N&scale=left&cosd=1990-01-01&coed=2021-01-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Annual&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2022-03-05&revision_date=2022-03-05&nd=1990-01-01")%>% 
  mutate(year = DATE %>% str_sub(1,4) %>% as.integer()) %>% 
  select(year,
         cpi_2015 = CPHPTT01EZA661N)

# transform to constant euros
trade = trade %>% 
  left_join(exchange_rate) %>% 
  left_join(cpi) %>% 
  mutate(value = value / usd_in_eur / (cpi_2015/100))
```

## Merge on NACE rev.1 crosswalk

Next, we apply apply a crosswalk from SITC (products) to NACE (industries).

```{r}
#load crosswalk
crosswalk_sitc_to_nace = read.csv("../data/COMTRADE/JobID-74_Concordance_S3_to_NC.CSV", stringsAsFactors = F) %>% 
  select(sitc = SITC.Revision.3.Product.Code,
         nace = NACE.Revision.1.Product.Code,
         nace_description = NACE.Revision.1.Product.Description)

# reshape wide and merge on Nace codes
trade_wide = trade %>% 
  unite(trade_flow, Partner_region, direction, sep = "_") %>% 
  ungroup() %>% 
  select(-n) %>% 
         spread(key = trade_flow, value = value, fill = 0) %>% 
  left_join(crosswalk_sitc_to_nace)
```

## Store

Finally, we export the resulting

```{r}
trade_wide %>% 
  select(sitc, year, China_Export, China_Import, Eastern_Europe_Export, Eastern_Europe_Import, nace, nace_description) %>% 
  haven::write_dta("orig/trade_DE_china_eastern_europe.dta")
```

# 3. ICT

We use data from EUKLEMS to calculate the ICT capital stock per worker in each region. Data is by industry (NACE rev. 2). For some industries, it is on the 2-digit level, for others only on the one digit level. \## Create ICT and store to folder

```{r}
sample_list = 
  list("DE" = c("DE"),
       "EU_non_DE" = c("AT", "BE", "CZ", "DK", "FI", "FR", "IT", "NL", "UK"))


for (entry in names(sample_list)) {
  euklems_capital = read_rds("../data/EUKLEMS/Statistical_Capital_EUKLEMS_2019.rds") %>% 
  mutate(indnr = indnr %>% as.integer()) %>% 
    filter(country %in% sample_list[[entry]],
         indnr <= 38,
         !is.na(indnr)) %>% 
  group_by(year, indnr, var, code) %>% 
    summarise(value = sum(value)) %>% 
  ungroup()
  
  ICT = euklems_capital %>% 
  filter(var %in% c("Kq_IT", "Kq_CT", "Kq_Soft_DB"),
         !indnr %in% c("Agg", "*Agg")) %>% 
  group_by(indnr, code, year) %>% 
  summarise(ICT_total = sum(value)) 
  
  if(entry == "EU_non_DE"){
    ICT %<>%
      rename(ICT_eu_total = ICT_total)
  }
  
  ICT %>% haven::write_dta(paste0("EUKLEMS_", entry, ".dta"))
}
```

## Convert crosswalk to dta

This crosswalk will be applied to the IAB data to aggregate the employment shares by industry to the same level as the EUKLEMS data.

```{r}
crosswalk_euklems_to_nace2 = read_excel("../data/EUKLEMS/crosswalk_euklems_to_nace2.xlsx")

crosswalk_euklems_to_nace2 %>% 
  rename(code = euklems_label) %>% 
  haven::write_dta("orig/crosswalk_euklems_to_nace2.dta")
```

# 4. Prepare employment per region from DFS replication files

To improve precision in the employment by industry distribution across counties, we use the exact total employment per region as registered with social security from Dauth, Findeisen and Suedekum (2017). Since they are directly affiliated at the IAB the have access to 100% of the social security data, whereas we as external researchers were only granted a 2% sample (which is still more than 1 million individuals).

```{r}
replication_data_DFS = readstata13::read.dta13("../data/replication_files_DFS_2017/data/RegionEntries9414.dta") %>% 
  select(year = jahr,
         region = ao_kreis,
         emp_DFS = E_rt) %>% 
  filter(year == 1994) %>% 
  select(-year) %>% 
  # create Goettingen (3159) out of Goettingen and Osterode am Harz (3152, 3156) as they were merged as of 2016:
  mutate(region = case_when(region %in% c(3152, 3156) ~ 3159,
                            TRUE ~ region)) %>% 
  group_by(region) %>% 
  summarise(emp_DFS = sum(emp_DFS))

replication_data_DFS %>% write_dta("orig/employment_DFS.dta", label = NULL)
```
